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Abstract 



As deep web grows at a very fast pace, there has been increased interest in 
techniques that help efficiently locate deep-web interfaces. However, due to the 
large volume of web resources and the dynamic nature of deep web, achieving 
wide coverage and high efficiency is a challenging issue. We propose a two-stage 
framework, namely Smart Crawler, for efficient harvesting deep web interfaces. 
In the first stage, Smart Crawler performs site-based searching for centre pages 
with the help of search engines, avoiding visiting a large number of pages. To 
achieve more accurate results for a focused crawl, Smart Crawler ranks websites 
to prioritize highly relevant ones for a given topic. In the second stage, Smart 
Crawler achieves fast in-site searching by excavating most relevant links with an 
adaptive link-ranking. To eliminate bias on visiting some highly relevant links in 
hidden web directories, we design a link tree data structure to achieve wider 
coverage for a website. Our experimental results on a set of representative 
domains show the agility and accuracy of our proposed crawler framework, 
which efficiently retrieves deep-web interfaces from large-scale sites and achieves 
higher harvest rates than other crawlers. 



1. Introduction 



It is challenging to locate the deep web databases, because they are not registered with any search 
engines, are usually sparsely distributed, and keep constantly changing. To address this problem, 
previous work has proposed two types of crawlers, generic crawlers and focused crawlers. Generic 
crawlers fetch all searchable forms and cannot focus on a specific topic. Focused crawlers such as 
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Form-Focused Crawler (FFC) and Adaptive Crawler for Hidden-web Entries (ACHE) can 
automatically search online databases on a specific topic. FFC is designed with link, page, and form 
classifiers for focused crawling of web forms, and is extended by ACHE with additional 
components for form filtering and adaptive link learner. The link classifiers in these crawlers play a 
pivotal role in achieving higher crawling efficiency than the best-first crawler. However, these link 
classifiers are used to predict the distance to the page containing searchable forms, which is 
difficult to estimate, especially for the delayed benefit links (links eventually lead to pages with 
forms). As a result, the crawler can be inefficiently led to pages without targeted forms. 



2. Overview Of Existing System 



The existing system is a manual or semi-automated system, i.e. The Textile Management 
System is the system that can directly sent to the shop and will purchase clothes whatever 
you wanted. The users are purchase dresses for festivals or by their need. They can spend 
time to purchase this by their choice like color, size, and designs, rate and so on. They But 
now in the world everyone is busy. They don’t need time to spend for this. Because they 
can spend whole the day to purchase for their whole family. So we proposed the new 
system for web crawling. Few disadvantages of existing system are: 

i. ) Consuming large amount of data’s. 

ii. ) Time wasting while crawl in the web. 



3. Proposed System 



We propose a two-stage framework, namely Smart Crawler, for efficient harvesting deep 
web interfaces. In the first stage, Smart Crawler performs site-based searching for center 
pages with the help of search engines, avoiding visiting a large number of pages. To 
achieve more accurate results for a focused crawl, Smart Crawler ranks websites to 
prioritize highly relevant ones for a given topic. In the second stage, Smart Crawler 
achieves fast in- site searching by excavating most relevant links with an adaptive link- 
ranking. To eliminate bias on visiting some highly relevant links in hidden web directories, 
we design a link tree data structure to achieve wider coverage for a website. Our 
experimental results on a set of representative domains show the agility and accuracy of 
our proposed crawler framework, which efficiently retrieves deep-web interfaces from 
large-scale sites and achieves higher harvest rates than other crawlers. By propose an 
effective harvesting framework for deep-web interfaces, namely Smart-Crawler we have 
shown that our approach achieves both wide coverage for deep web interfaces and 
maintains highly efficient crawling. Smart Crawler is a focused crawler consisting of two 
stages: efficient site locating and balanced in-site exploring. Smart Crawler performs site- 
based locating by reversely searching the known deep web sites for center pages, which can 
effectively find many data sources for sparse domains. By ranking collected sites and by 
focusing the crawling on a topic, Smart Crawler achieves more accurate results. Main 
advantages of proposed system are: 

i.) A novel two-stage framework to address the problem of searching for hidden-web resources. 
Our site locating technique employs a reverse searching technique (e.g., using Google’s 
’’link:” facility to get pages pointing to a given link) and incremental two-level site 
prioritizing technique for unearthing relevant sites, achieving more data sources. During the 
in-site exploring stage, we design a link tree for balanced link prioritizing, eliminating bias 
toward web pages in popular directories. 
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ii.) An adaptive learning algorithm that performs online feature selection and uses these features 
to automatically construct link rankers. In the site locating stage, high relevant sites are 
prioritized and the crawling is focused on a topic using the contents of the root page of sites, 
achieving more accurate results. During the insight exploring stage, relevant links are 
prioritized for fast in-site searching. 
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Figure 1: Showing System Architecture 



4. Literature Survey 



Literature survey is the most important step in software development process. Before developing 
the tool, it is necessary to determine the time factor, economy and company strength. 

• Denis Shestakov & Tapio Salakoski: Host-ip clustering technique for deep web 
characterization. This paper presents a huge portion of today’s Web consists of web pages 
filled with information from myriads of online databases. This part of the Web, known as the 
deep Web, is to date relatively unexplored and even major characteristics such as number of 
searchable databases on the Web is somewhat disputable. In this paper, we are aimed at more 
accurate estimation of main parameters of the deep Web by sampling one national web domain. 
We propose the Host-IP clustering sampling technique that addresses drawbacks of existing 
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approaches to characterize the deep Web and report our findings based on the survey of Russian 
Web conducted in September 2006. Obtained estimates together with a proposed sampling 
method could be useful for further studies to handle data in the deep Web. [13] 

• Luciano Barbosa and Juliana Freire : Searching for hidden-web databases. Recently, there 
has been increased interest in the retrieval and integration of hidden Web data with a view to 
leverage high-quality information available in online databases. Although previous works have 
addressed many aspects of the actual integration, including matching form schemata and 
automatically filling out forms, the problem of locating relevant data sources has been largely 
overlooked. Given the dynamic nature of the Web, where data sources are constantly changing, 
it is crucial to automatically discover these resources. However, considering the number of 
documents on the Web (Google already indexes over 8 billion documents), automatically 
finding tens, hundreds or even thousands of forms that are relevant to the integration task is 
really like looking for a few needles in a haystack. Besides, since the vocabulary and structure 
of forms for a given domain are unknown until the forms are actually found, it is hard to define 
exactly what to look for. We propose a new crawling strategy to automatically locate hidden- 
Web databases which aims to achieve a balance between the two conflicting requirements of 
this problem: The need to perform a broad search while at the same time avoiding the need to 
crawler large number of irrelevant pages. The proposed strategy does that by focusing the crawl 
on a given topic; by judiciously choosing links to follow within a topic that are more likely to 
lead to pages that contain forms; and by employing appropriate stopping criteria. We describe 
the algorithms underlying this strategy and an experimental evaluation which shows that our 
approach is both effective and efficient, leading to larger numbers of forms retrieved as a 
function of the number of pages visited than other crawlers. [16] 

• Andre Bergholz and Boris Childlovskii: Crawling for domain specific hidden web resources. 
In this study authors have explain the Hidden Web, the part of the Web that remains 
unavailable for standard crawlers, has become an important research topic during recent years. 
Its size is estimated to 400 to 500 times larger than that of the publicly indexable Web (PIW). 
Furthermore, the information on the hidden Web is assumed to be more structured, because it is 
usually stored in databases. In this paper, we describe a crawler which starting from the PIW 
finds entry points into the hidden Web. The crawler is domain-specific and is initialized with 
pre-classified documents and relevant keywords. We describe our approach to the automatic 
identification of Hidden Web resources among encountered HTML forms. We conduct a series 
of experiments using the top-level categories in the Google directory and report our analysis of 
the discovered Hidden Web resources. 

• Martin Hilbert: How to Measure How Much Information?ln this innovative study author 
have represented the modern-day fascination of social scientists with inventories of social 
information and communication goes at least back to Machlup’s ground-breaking work. 
Following the logic of national accounting in economics, Machlup identified those sectors of 
the economy that he (quite subjectively) considered to be information- and knowledge- 
intensive and tracked the size of the respective industries (in US dollars) and occupational 
force. Following Machlups lead, Porat (1977) evolved this approach. He famously concluded 
that the value of the composed labor and capital resources of these information sectors made up 
25 percentage of U.S. gross domestic product in 1967. This estimate is based on a rather 
subjective identification of information capital and information workers. He measures the 
economic value of the related information activity [which] includes all the resources consumed 
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in producing, processing, and distributing information goods and services (p. 2). As information 
capital he loosely identified a wide variety of information capital resources [which] are used to 
deliver the informational requirements of one firm: typewriters, calculators, copiers, terminals, 
computers, telephones and switchboards . . . microwave antennae, satellite dishes and facsimile 
machines (pp. 23). Despite all coarse-graining and methodological arbitrariness of this 
approach to represent the role of information in an economy, Machlup and Porats work 
constitute important milestones with regard to evaluating the economic dimensions of 
information in a society. [1] 

• Yeye Hey, Dong Xin, Venkatesh Ganti,Sriram Rajaraman,Nirav Shah:Crawling Deep Web 
Entity Pages?. This paper showed that Deep- web crawl is concerned with the problem of 
surfacing hidden content behind search interfaces on the Web. While many deep- web sites 
maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has 
traditionally been the focus of the deep-web literature, we observe that a significant portion of 
deep- web sites, including almost all online shopping sites, curate structured entities as opposed 
to text documents. Although crawling such entity-oriented content is clearly useful for a variety 
of purposes, existing crawling techniques optimized for document oriented content are not best 
suited for entity-oriented sites. In this work, we describe a prototype system we have built that 
specializes in crawling entity-oriented deep-Web sites. We propose techniques tailored to tackle 
important sub problems including query generation, empty page filtering and URL duplication 
in the specific context of entity oriented deep-web sites. These techniques are experimentally 
evaluated and shown to be effective. [7] 



5. Conclusion And Future Scope 



As profound web develops at a quick pace, there has been expanded enthusiasm for methods that 
assist proficiently with finding profound web interfaces. Nonetheless, because of the extensive 
volume of web assets and the dynamic way of profound web, accomplishing wide scope and high 
productivity is a testing issue. We propose a two- stage structure, in particular Smart Crawler, for 
effective gathering profound web interfaces. In the first stage, Smart Crawler performs site -based 
hunting down focus pages with the assistance of web indexes, abstaining from going by countless. 
In this article we have seen two basic stages for smart crawling web harvesting system. In this 
article also we have state Site locating and in Site exploring. As web crawlers becomes increasingly 
popular, efficient searching of web pages , such as page ranking to compare most searched pages 
and web sites for efficient and fast also accurate data finding is become essay. The performance of 
these operations directly affects the usability of the benefits offered by smart web crawlers. 
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